Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.9 - Filling in Null Values")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white |
4 | 4 | 2 | None | 2019-01-01 10:05:10 | 13 | None |
Filling in Null Values
Working with real world data, some information can be missing but can be interprelated from other columns or set with default values. These interprelated values or deafult value will thus fill in those missing values. Here we will show you how to.
Option 1 - Fill in All Missing Values with a Default Value
(
pets
.fillna('Unknown')
.toPandas()
)
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | Unknown |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | Unknown |
3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white |
4 | 4 | 2 | Unknown | 2019-01-01 10:05:10 | 13 | Unknown |
What Happened?
Using fillna
we attempt to fill in all Null
values with the value 'Unknown'
. This will be fine if all the Null
values are strings, but it won't work if for say the age
column is missing values. We look at how to specify different values for different columns next.
Option 2 - Fill in All Missing Values with a Mapping
(
pets
.fillna({
'nickname': 'Unknown Nickname',
'color': 'Unknown Color',
})
.toPandas()
)
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | Unknown Color |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | Unknown Color |
3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white |
4 | 4 | 2 | Unknown Nickname | 2019-01-01 10:05:10 | 13 | Unknown Color |
What Happened?
You have the option of filling in each column with a diffferent value. This provides more flexibility as most times the columns will be different types and a single deafult value won't be sufficient enough.
Option 2 - coalesce()
(
pets
.withColumn('bogus', F.coalesce(F.col('color'), F.col('nickname'), F.lit('Default')))
.toPandas()
)
id | breed_id | nickname | birthday | age | color | bogus | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | Argus |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | Chewie |
3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white | white |
4 | 4 | 2 | None | 2019-01-01 10:05:10 | 13 | None | Default |
What Happened?
Another way to fill in a column with values is using coalesce()
. This function will try to fill in the specified columns by looking at the given arguments in order from left to right, until one of the arguments is not null and use that. If all else fails, you can provide a "default" value as your last arugment (remembering that it should be a columnar expression).
In our example, it will attempt to fill in the bogus
column with values from the color
column first, if that is null then try the nickname
column next, and if both are null it will use the deafult value Default
.
Summary
- We looked at a generic way of filling in a columns with a single default value.
- We looked at providing a mapping of
{column:value}
to fill in each column seperately. - Lastly we looked at how to fill in a column using other columns and default values in an order of precedency.